NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Safety Interventions against Adversarial Patches in an Open-Source Driver Assistance System

Chen, Cheng; Xiao, Grant; Lee, Daehyun; Yang, Lishan; Smirni, Evgenia; Alemzadeh, Homa; Zhou, Xugui (June 2025, IEEE)

Free, publicly-accessible full text available June 28, 2026
Aspis: Lightweight Neural Network Protection Against Soft Errors

https://doi.org/10.1109/ISSRE62328.2024.00036

Schmedding, Anna; Yang, Lishan; Jog, Adwait; Smirni, Evgenia (October 2024, IEEE)

Convolutional neural networks (CNN) are incorporated into many image-based tasks across a variety of domains. Some of these are safety critical tasks such as object classification/detection and lane detection for self-driving cars. These applications have strict safety requirements and must guarantee the reliable operation of the neural networks in the presence of soft errors (i.e., transient faults) in DRAM. Standard safety mechanisms (e.g., triplication of data/computation) provide high resilience, but introduce intolerable overhead. We perform detailed characterization and propose an efficient methodology for pinpointing critical weights by using an efficient proxy, the Taylor criterion. Using this characterization, we design Aspis, an efficient software protection scheme that does selective weight hardening and offers a performance/reliability tradeoff. Aspis provides higher resilience comparing to state-of-the-art methods and is integrated into PyTorch as a fully-automated library.
more » « less
Full Text Available
GPU Reliability Assessment: Insights Across the Abstraction Layers

https://doi.org/10.1109/CLUSTER59578.2024.00008

Yang, Lishan; Papadimitriou, George; Sartzetakis, Dimitris; Jog, Adwait; Smirni, Evgenia; Gizopoulos, Dimitris (September 2024, IEEE)

Graphics Processing Units (GPUs) are widely de-ployed and utilized across various computing domains including cloud and high-performance computing. Considering its extensive usage and increasing popularity, ensuring GPU reliability is cru-cial. Software-based reliability evaluation methodologies, though fast, often neglect the complex hardware details of modern GPU designs. This oversight could lead to misleading measurements and misguided decisions regarding protection strategies. This paper breaks new ground by conducting an in-depth examination of well-established vulnerability assessment methods for modern GPU architectures, from the microarchitecture all the way to the software layers. It highlights divergences between popular software-based vulnerability evaluation methods and the ground truth cross-layer evaluation, which persist even under strong protections like triple modular redundancy. Accurate evaluation requires considering fault distribution from hardware to software. Our comprehensive measurements offer valuable insights into the accurate assessment of GPU reliability.
more » « less
Full Text Available
Strategic Resilience Evaluation of Neural Networks Within Autonomous VehicleSoftwareAnna Schmedding, Philip Schowitz, Xugui Zhou, Yiyang Lu, Lishan Yang, Homa Alemzadeh, and Evgenia Smirni

Schmedding, Anna; Schowitz, Philip; Zhou, Xugui; Lu, Yiyang; Yang, Lishan; Alemzadeh, Homa; Smirni, Evgenia (September 2024, Lecture Notes in Computer Science 14988, 43rd International Conference on Computer Safety, Reliability, and SecurityC, SAFECOMP 2024)
Cecarelli, Andrea; Trapp, Mario; Bondavalli, Andrea; Bitsch, Friedemann (Ed.)
Simulation-basedFaultInjection(FI)ishighlyrecommended by functional safety standards in the automotive and aerospace domains, in order to “support the argumentation of completeness and correctness of a system architectural design with respect to faults” (ISO 26262). We argue that a library of failure models facilitates this process. Such a library, firstly, supports completeness claims through, e.g., an extensive and systematic collection process. Secondly, we argue why failure model specifications should be executable—to be implemented as FI operators within a simulation framework—and parametrizable—to be relevant and accurate for different systems. Given the distributed nature of automo- tive and aerospace development processes, we moreover argue that a data-flow-based definition allows failure models to be applied to black- box components. Yet, existing sources for failure models provide frag- mented, ambiguous, incomplete, and redundant information, often meet- ing neither of the above requirements. We therefore introduce a library of 18 executable and parameterizable failure models collected with a sys- tematic literature survey focusing on automotive and aerospace Cyber- Physical Systems (CPS). To demonstrate the applicability to simulation- based FI, we implement and apply a selection of failure models to a real- world automotive CPS within a state-of-the-art simulation environment, and highlight their impact.
more » « less
Full Text Available
Strategic Resilience Evaluation of Neural Networks Within Autonomous Vehicle Software

https://doi.org/10.1007/978-3-031-68606-1_3

Schmedding, Anna; Schowitz, Philip; Zhou, Xugui; Lu, Yiyang; Yang, Lishan; Alemzadeh, Homa; Smirni, Evgenia (September 2024, Springer Nature Switzerland)

Full Text Available
Epidemic Spread Modeling for COVID-19 Using Cross-Fertilization of Mobility Data

https://doi.org/10.1109/TBDATA.2023.3248650

Schmedding, Anna; Pinciroli, Riccardo; Yang, Lishan; Smirni, Evgenia (October 2023, IEEE Transactions on Big Data)

Full Text Available
Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models

https://doi.org/10.1109/TDSC.2021.3131571

Pinciroli, Riccardo; Yang, Lishan; Alter, Jacob; Smirni, Evgenia (January 2023, IEEE Transactions on Dependable and Secure Computing)

Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms.
more » « less
Full Text Available
GeoSpread: an Epidemic Spread Modeling Tool for COVID-19 Using Mobility Data

https://doi.org/10.1145/3524458.3547257

Schmedding, Anna; Yang, Lishan; Pinciroli, Riccardo; Smirni, Evgenia (September 2022, GoodIT 2022: {ACM} International Conference on Information Technology for Social Good, Limassol, Cyprus, September 7 - 9, 2022)
Mourlas, cotas; Pacheco, Diego; Pandi, Catia (Ed.)
We present an individual-centric agent-based model and a flexible tool, GeoSpread, for studying and predicting the spread of viruses and diseases in urban settings. Using COVID-19 data collected by the Korean Center for Disease Control & Prevention (KCDC), we analyze patient and route data of infected people from January 20, 2020, to May 31, 2020, and discover how infection clusters develop as a function of time. This analysis offers a statistical characterization of population mobility and is used to parameterize GeoSpread to capture the spread of the disease. We validate simulation predictions from GeoSpread with ground truth and we evaluate different what-if counter-measure scenarios to illustrate the usefulness and flexibility of the tool for epidemic modeling.
more » « less
Full Text Available
Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

https://doi.org/10.1109/ICSE43902.2021.00114

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (May 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE))
null (Ed.)
Full Text Available
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

https://doi.org/10.1145/3447375

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (February 2021, Proceedings of the ACM on Measurement and Analysis of Computing Systems)
null (Ed.)
As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.
more » « less
Full Text Available

« Prev Next »

Search for: All records